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Abstract 

We  define  Policy  Reuse  as  a  learning  technique  guided  by  past  policies  offering  the  challenge  of 
balancing  among  three  choices:  exploitation  of  the  ongoing  learned  policy,  exploration  of  random 
actions,  and  exploration  towards  the  past  policies.  In  this  work  we  introduce  a  new  exploration 
strategy,  7r-reuse,  as  an  intelligent  bias  to  reuse  a  past  policy  when  learning  a  new  one.  Inter¬ 
estingly,  this  strategy  also  provides  a  similarity  metric  among  a  set  of  past  policies  and  the  new 
one.  We  therefore  define  a  7r -reuse  based  similarity  metric  between  policies.  We  introduce  a  new 
algorithm  that  combines  the  selection  and  reuse  of  past  policies  using  this  similarity  metric.  We 
show  empirical  results  that  demonstrate  the  usefulness  of  our  exploration  strategy,  7r -reuse,  as  an 
intelligent  bias  to  reuse  past  policies,  and  also,  its  effectiveness  in  defining  similarity  between 
policies. 


1  Introduction 


Policy  Reuse  can  be  defined  as  the  capability  of  integrating  past  action  policies  in  new  learning 
processes.  In  this  work,  the  motivation  of  Policy  Reuse  is  to  use  the  knowledge  acquired  to  solve 
different  tasks  when  learning  a  new  one  in  the  same  domain.  The  domain  defines  how  the  agent 
behaves  in  the  environment,  i.e.  the  state  transition  function;  each  different  task  in  the  same  domain 
is  characterized  through  its  reward  function. 

We  introduce  reusing  of  past  policies  in  Reinforcement  Learning  as  an  exploration  bias  during 
a  learning  process.  However,  it  is  still  a  challenge,  given  that  biasing  the  learning  inherently 
complicates  the  exploration  strategy.  That  is  because  in  addition  to  the  classical  balance  between 
exploring  new  states  or  exploiting  the  current  policy,  it  adds  a  third  factor  of  exploiting  the  past 
policy.  However,  this  balance  has  been  successfully  found  in  other  problems  like  path  planning, 
where  reusing  waypoints  used  in  past  plans  has  demonstrated  to  be  useful  to  solve  new  planning 
problems  [3]. 

In  this  work  we  introduce  a  new  exploration  strategy,  called  7r-reuse,  that  integrates  a  past 
policy  in  an  ongoing  learning  process.  This  strategy  assumes  that  a  supervisor  provides  the  action 
policy  used  to  bias  the  exploration.  We  demonstrate  that  the  learning  performance  of  the  new 
policy  can  be  improved  by  biasing  the  exploratory  process  with  the  past  policy,  depending  on 
whether  the  policy  provided  by  the  supervisor  solved  a  task  which  was  “similar”  to  the  new  one  or 
not. 

However,  the  application  of  Policy  Reuse  is  much  more  complex  if  we  receive  a  set  of  policies, 
because  then  we  need  to  select  the  most  accurate  one  to  bias  the  learning  of  the  new  task.  In 
this  sense,  we  exploit  the  capabilities  of  Policy  Reuse  to  define  a  similarity  metric  between  the 
past  policies  and  the  new  one.  This  similarity  metric  is  based  on  the  performance  obtained  when 
following  the  7r -reuse  strategy  to  solve  the  new  task  with  the  different  past  policies.  The  higher  the 
performance  is,  the  higher  the  similarity  is. 

The  report  is  organized  as  follow.  The  next  section  summarizes  related  work,  focusing  on 
exploration  strategies  and  in  policy  reuse  methods.  Section  3  formalizes  the  concepts  of  task 
and  domain.  Section  4  introduces  the  new  exploration  strategy,  7r-reuse.  Section  5  describes  the 
experiments  performed,  whose  results  motivate  the  definition  of  the  similarity  metric  presented  in 
Section  6.  Section  7  discusses  the  main  conclusions  and  further  research. 


2  Related  Work 

This  work  is  motivated  by  two  main  research  areas,  the  reuse  of  past  policies  and  exploration 
strategies.  Reusing  sub-policies  which  were  learned  for  a  different  but  related  task  can  be  used 
to  minimize  the  experience  required  to  solve  a  new  task.  For  instance  a  subproblem  of  an  MDP 
can  be  defined  as  a  new  MDP  where  the  state  space  is  a  subset  of  the  original  one.  Then,  the 
original  MDP  can  be  solved  reusing  policies  learned  for  different  subproblems  [2].  Intra-Option 
Learning  [9]  and  TTrees  [13]  also  reuse  macro-actions  to  learn  new  action  policies,  in  both  cases, 
in  Semi-Markov  Decision  Processes.  Hierarchical  RL  uses  different  abstraction  levels  to  organize 
subtasks  [5]. 

Some  methods  try  to  learn  environment  independent  knowledge  so  the  learned  knowledge  can 
be  used  for  similar  tasks  in  different  scenarios  [11].  Reusing  the  Q  function  that  represents  a 


1 


policy  learned  for  a  task  can  be  useful  if  it  is  similar  to  the  new  one  [4].  However,  it  requires  the  Q 
function  to  be  available,  and  not  only  the  policy. 

Balancing  exploration  and  exploitation  is  typically  exemplified  with  the  multi-armed  bandit 
problem  [8],  and  tries  to  define  whether  to  explore  new  or  exploit  the  knowledge  already  ac¬ 
quired  [1],  In  the  literature,  different  kinds  of  exploration  strategies  can  be  found.  A  random 
strategy  always  selects  randomly  the  action  to  execute,  without  using  the  acquired  knowledge. 
The  e-greedy  strategy  selects  the  best  action  suggested  by  the  Q  function  with  a  probability  of  e, 
and  it  selects  a  random  action  with  probability  of  (1  —  e).  Boltzmann  strategy  ranks  the  actions, 
providing  with  a  higher  probability  to  the  actions  with  a  higher  value  of  Q. 

Directed  exploration  strategies  memorize  exploration-specific  knowledge  that  is  used  for  guid¬ 
ing  the  exploration  search[10].  These  strategies  are  based  in  heuristics  that  bias  the  learning  so 
unexplored  states  tend  to  have  a  higher  probability  of  being  explored  that  recently  visited  ones. 
However,  most  of  them  require  a  model  of  the  domain  (the  state  transition  function)  to  execute  the 
heuristics. 

Most  of  the  previous  examples  are  focused  only  on  exploration  or  in  reuse  of  sub-policies. 
Instead,  our  work  focuses  on  policy  as  an  exploration  bias  in  the  new  learning  problem,  and  we 
investigate  such  exploration  strategies. 


3  Domains  and  Tasks 

Markov  Decision  Process  [7]  is  represented  with  a  tuple  <  S,  A,  S,  7Z  >,  where  S  is  the  set  of 
all  possible  states,  A  is  the  set  of  all  possible  actions,  5  is  an  unknown  stochastic  state  transition 
function,  <5:tSx*4.XtS-^9?,  and  TZ  is  an  unknown  stochastic  reward  function,  TZ  :  S  x  A  — >  3?. 
We  focus  in  RL  domains  where  different  tasks  can  be  solved.  We  introduce  a  task  as  a  specific 
reward  function,  but  the  other  concepts,  S,  A  and  5  stay  constant  for  all  the  tasks.  Thus,  we 
extend  the  concept  of  an  MDP  introducing  two  new  concepts:  domain  and  task.  We  characterize 
a  domain,  V,  as  a  tuple  <  5,  A,  6  >.  We  define  a  task,  Q,  as  a  tuple  <  V ,  7 Zq  >,  where  V  is  a 
domain  as  defined  before,  and  7 Zq  is  the  stochastic  and  unknown  reward  function. 

In  this  work  we  assume  that  we  are  solving  a  task  with  absorbing  goal  states.  Thus,  if  sl  is  a 
goal  state,  <5(s;,  a,  Si )  =  1,  <5(s;,  a,  Sj )  =  0  for  s*  ^  Sj,  and  7 Z(si,  a)  =  0,  for  all  a  e  A.  A  trial 
starts  by  locating  the  learning  agent  in  a  random  position  in  the  environment.  Each  trial  finishes 
when  a  goal  state  is  reached  or  when  a  maximum  number  of  steps,  say  H,  is  achieved.  Thus,  the 
goal  is  to  maximize  the  expected  average  reinforcement  per  trial,  say  W,  as  defined  in  equation  1: 

1  K  H 

<d 

k= 0  h= 0 

where  7  (0  <  7  <  1)  reduces  the  importance  of  future  rewards,  and  rk  h  defines  the  immediate 
reward  obtained  in  the  step  h  of  the  trial  k,  in  a  total  of  K  trials. 
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4  An  Exploratory  Strategy  for  Policy  Reuse 

The  goal  of  this  work  is  to  describe  how  learning  can  be  helped  if  different  policies,  which  solve 
different  tasks,  are  used  in  the  learning  of  the  action  policy  of  another  similar  task.  But  first,  we 
need  to  describe  how  only  one  past  policy  biases  the  learning  of  the  new  one. 

4.1  Scope 

We  define  an  action  policy,  II,  as  a  function  II  :  S  —>  A.  If  the  action  policy  was  created  to  solve 
a  defined  task,  Q,  the  action  policy  is  called  1  \<i.  The  gain,  or  average  expected  reward,  received 
when  executing  an  action  policy  II  in  the  task  Q  is  called  Hy1/.  Lastly,  an  optimal  action  policy  for 
solving  the  task  Q  is  called  11^.  Then,  the  scope  of  this  section  is  the  following: 

•  We  need  to  solve  the  task  Q,  i.e.  leam  11^. 

•  We  have  previously  solved  the  set  of  tasks  {Oj , . . . ,  Q,,},  so  we  have  their  respective  optimal 
policies,  {n^,...,n^n} 

•  Let’s  assume  that  there  is  a  supervisor  who,  given  0,  tells  us  which  is  the  most  similar  task, 

to  Q.  Thus,  we  know  that  the  policy  to  reuse  is  II*. 

Thus,  in  this  section  we  assume  that  it  exists  a  supervisor  who  provides  a  policy  that  solves 
a  task  similar  to  the  one  that  we  are  trying  to  solve.  A  discussion  on  how  similarities  between 
tasks  and  their  respective  policies  can  be  computed,  and  how  to  automatically  estimate  the  policy 
to  reuse,  will  be  introduced  in  Section  6. 

4.2  The  7r-reuse  Exploration  Strategy 

We  denote  the  old  policy  with  Ylold,  and  the  one  we  are  currently  learning  with  II.  We  assume 
that  we  are  using  a  direct  RL  method  to  leam  the  action  policy,  so  we  are  learning  its  related  Q 
function.  Any  algorithm  can  be  used  to  learn  the  Q  function,  with  the  only  requirement  that  it  can 
leam  off-policy,  i.e.  it  can  leam  a  policy  while  executing  a  different  one,  as  Q-Leaming  does  [14]. 

The  goal  of  the  7r-reuse  strategy  is  to  balance  random  exploration,  exploitation  of  the  old  policy, 
and  exploitation  of  the  new  policy,  which  is  being  learned  currently.  The  7r-reuse  strategy  follows 
the  past  policy  with  a  probability  of  ip.  However,  with  a  probability  of  1  —  ip,  it  exploits  the  new 
policy.  Obviously,  random  exploration  is  always  required,  so  when  exploiting  the  new  policy,  it 
follows  an  e-greedy  strategy,  as  is  defined  in  Table  1 .  Lastly,  the  v  parameter  allows  to  decay  the 
value  of  ip  in  each  trial. 

Thus,  there  are  three  probabilities  involved:  the  probability  of  exploiting  the  past  policy,  the 
probability  of  using  current  policy,  and  the  probability  of  acting  randomly.  These  probabilities  are 
shown  in  Figure  1,  for  input  values  of  H  =  100,  ip  —  1  and  v  =  0.95.  In  this  case  the  e  parameter 
is  set  in  each  step  to  1  —  iph. 

The  figure  shows  that  in  the  initial  steps  of  each  trial,  the  past  policy  is  exploited.  As  the 
number  of  steps  increases,  exploration  also  increases,  while  in  the  final  steps  of  the  trial,  the  new 
policy  will  be  exploited.  The  transition  from  exploiting  the  past  policy  and  exploiting  the  new  one 
depends  on  the  v  parameter.  If  this  parameter  is  low,  the  transition  occurs  in  the  initial  steps,  while 
if  it  is  high,  the  transition  is  delayed. 
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7r-reuse  (II oM,  K,  H ,  i/j,  iQ, _ 

for  k  =  1  to  Jl 

Set  the  initial  state,  s,  randomly. 

Set  <—  ^ 
for  /i  =  1  to  H 

With  a  probability  of  iph,  a  =  IIoW(s) 

With  a  probability  of  1  —  if;h,  a  =  e-greedy(ITnew,(s)) 
Receive  current  state  s',  and  reward,  rk  h 
Update  QUnew  (s,  a),  and  therefore,  Iinew 
Set  t/Vi+i  <-  hv 
Set  s  * —  s' 

w  =  ^Ef=oEf=o7Vft 

Return  IT’  and  II,)eu, 


Table  1:  7r-reuse  Exploration  Strategy. 


5  Experiments 

In  this  section,  we  describe  the  experiments  performed  to  demonstrate  the  usefulness  of  the  explo¬ 
ration  strategy  defined  above.  But  first,  we  describe  the  domain  used. 

5.1  Tasks  in  a  Robot  Navigation  Domain 

This  domain  consists  of  a  robot  moving  inside  of  an  office  area,  as  shown  in  Figure  2,  similar  to  the 
one  used  in  other  RL  works  [6,  12].  The  environment  is  represented  by  walls,  free  positions  and 
goal  areas,  all  of  them  of  size  lxl.  The  whole  domain  is  N  x  M  (24  x  21  in  this  case).  The  possible 
actions  that  the  robot  can  execute  are  “North”,  “East”,  “South”  and  “West”,  all  of  size  one.  The 
final  position  after  each  action  is  noised  by  a  random  variable  following  a  uniform  distribution  in 
the  range  (—0.20,  0.20).  The  robot  knows  its  location  in  the  space  through  continuous  coordinates 
(x,  y )  provided  by  some  localization  system.  In  this  work,  we  assume  that  we  have  the  optimal 
uniform  discretization  of  the  state  space  (which  consists  of  24  x  21  regions).  Furthermore,  the 
robot  has  an  obstacle  avoidance  system  that  blocks  the  execution  of  actions  that  would  crash  it  into 
a  wall.  The  goal  in  this  domain  is  to  reach  the  area  marked  with  ’G’.  When  the  robot  reaches  it,  it 
is  considered  a  successful  trial,  and  it  receives  a  reward  of  1.  Otherwise,  it  receives  a  reward  of  0. 

Figure  2  shows  six  different  tasks  in  the  same  domain,  Ui,  fl2,  U3,  04  and  U,  given  that  the 
goal  states,  and  therefore,  the  reward  functions,  are  different.  Biasing  the  learning  of  II  (to  solve 
ff)  using  IIi  (policy  that  solves  ffi)  seems  to  be  useful  given  that  both  policies  could  be  equal  for 
a  large  number  of  states.  However,  what  states  share  the  same  policy  and  what  states  do  not  is 
completely  unknown  a  priori,  given  that  both  the  reward  function  and  the  state  transition  function 
are  unknown. 
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Figure  1:  Evolution  of  the  probabilities  of  exploring  and  exploiting  in  a  trial  for  the  7r-reuse  explo¬ 
ration  strategy. 


5.2  Description  of  the  Learning  Curves 

In  the  following  subsections,  we  will  describe  the  experimental  results  of  applying  different  ex¬ 
ploration  strategies  for  learning  the  task  0,  shown  in  Figure  2(e).  For  each  of  these  strategies  (and 
parameter  settings),  we  will  present  two  results  showing  two  different  curves,  the  learning  curve, 
and  the  test  curve. 

The  learning  curve  of  each  strategy  describes  the  performance  of  such  strategy  in  the  learning 
process.  Feaming  has  been  performed  using  the  Q-Fearning  algorithm,  for  fixed  parameters  of 
7  =  0.95  and  a  =  0.05,  which  empirically  have  demonstrated  to  be  accurate  for  learning. 

A  learning  trial  consists  of  executing  K  =  2000  trials.  Each  trial  consists  on  following  the 
defined  strategy  until  the  goal  is  achieved  or  until  the  maximum  number  of  steps,  H  =  100,  is 
executed.  In  the  figures  containing  the  curves,  the  x  axis  shows  the  trial  number.  The  y  axis 
represents  the  gain  obtained.  Thus,  a  value  of  0.2  for  the  trial  200  means  that  the  average  gain 
obtained  in  the  200  first  trials  has  been  0.2. 

The  test  curve  represents  the  evolution  of  the  performance  of  the  policy  while  it  is  being 
learned.  Each  100  trials  of  the  learning  process,  the  Q  function  learned  up  to  that  moment  is 
stored.  Thus,  after  the  learning  process,  we  can  test  all  those  policies.  Each  test  consists  on  1000 
trials  where  the  robot  follows  a  completely  greedy  strategy.  Thus,  the  x  axis  shows  the  learning 
trial  in  which  that  policy  was  generated,  and  the  y  axis  show  the  result  of  the  test,  measured  as  the 
average  number  of  steps  executed  to  achieve  the  goal  in  the  1000  test  trials. 

For  both  the  learning  and  test  curves,  the  results  provided  are  the  average  of  ten  executions.  In 
the  curves,  error  bars  provide  the  standard  deviation  in  the  ten  executions. 

5.3  Learning  from  Scratch 

We  want  to  leam  the  task  described  in  Figure  2(e).  For  comparison  reasons,  the  learning  and  test 
processes  have  been  executed  firstly  following  different  exploratory  strategies  that  do  not  use  any 
past  policy.  Specifically,  we  have  used  four  different  strategies.  The  first  one  is  a  random  strategy. 
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(d)  Task  fi4 


(e)  Task  fi 


Figure  2:  Office  Domain. 

The  second  one  is  a  completely  greedy  strategy.  The  third  one  is  e-greedy,  for  an  initial  value  of 
e  =  0,  which  is  incremented  by  0.0005  in  each  trial.  Lastly,  Boltzmann  strategy  has  been  used, 
initializing  r  =  0,  and  increasing  it  in  5  in  each  learning  trial.  Figure  3  shows  the  learning  and  test 
curves  for  all  of  them. 


Random  -  1-greedy  .  e-greedy -  Bolzmann  -  -  Random  -  1-greedy  .  e-greedy -  Bolzmann 


(a)  Learning  Curve  (b)  Test  Curve 

Figure  3:  Learning  and  test  evolution  when  learning  from  scratch 

Figure  3(a)  shows  the  learning  curve.  We  see  that  when  acting  randomly,  the  average  gain 
in  learning  is  almost  0,  given  that  acting  randomly  is  a  very  poor  strategy.  However,  when  a 
greedy  behavior  is  introduced,  (strategy  1-greedy),  the  curve  shows  a  slow  increment,  achieving 
values  of  almost  0.1.  The  problem  with  the  1-greedy  strategy  is  that  it  also  produces  a  very  high 
standard  deviation  in  the  10  executions  performed,  showing  that  a  completely  greedy  strategy  may 
produce  very  different  results.  The  curve  obtained  by  the  Boltzmann  strategy  do  not  offer  any 
improvements.  However,  the  e-greedy  strategy  seems  to  compute  an  accurate  policy  in  the  initials 
trials,  and  obtain  the  highest  average  gain  at  the  end  of  the  learning. 

The  random  strategy  and  e-greedy  outperforms  the  other  strategies  in  the  test  curve  shown  in 
Figure  3(b).  This  is  due  to  the  fact  that  both  strategies,  with  the  defined  parameters,  are  less  greedy 
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than  the  other  policies  in  the  initial  steps.  Typically,  higher  exploration  at  the  beginning  results  in 
more  accurate  policies. 

5.4  Reusing  the  Past  Policy  Following  t- reuse 

We  want  to  learn  to  solve  the  task  Q,  defined  in  Figure  2(e).  To  do  this,  we  need  to  leam  the  action 
policy,  IIq  that  maximizes  ITT j ,  as  defined  in  equation  1.  In  this  case,  we  assume  that  a  supervisor 
provides  a  similar  task,  say  Qs,  and  the  exploration  strategy  7r -reuse  is  used  to  leam  the  new  action 
policy. 

Figure  4(a)  shows  the  learning  curves  of  different  learning  processes.  In  each  of  them,  a  dif¬ 
ferent  policy  has  been  reused.  As  in  the  previous  experiments,  the  parameters  used  are:  7  =  0.95 
and  a  =  0.05  in  the  Q-Learning  update  function,  ip  —  1  and  v  =  0.95,  which  empirically  have 
demonstrated  to  be  accurate.  We  distinguish  three  different  cases.  In  the  first  one,  the  task  previ¬ 
ously  learned  is  04  (Qs  =  fii),  the  goal  of  which  is  into  the  same  room  as  the  goal  of  Q.  In  the 
second  case,  Qs  =  fii,  so  their  goals  are  in  different  rooms.  However,  their  optimal  policies  could 
be  the  same  for  all  the  domain  except  for  the  rooms  where  the  respective  goals  are  located.  In  the 
last  two  cases,  Qs  =  (-l>  and  03  respectively,  which  are  very  different  when  compared  to  O. 


Learning  fromn  2  Learning  from  II  ^  Learning  from  II  ^  Learning  fromll  2  Learning  from  FI ,  Learning  from  II  ^ 

Learning  from  FI  j  Learning  from  II  j 


(a)  Learning  Curve  (b)  Test  Curve 

Figure  4:  Learning  and  test  evolution  when  following  the  exploration  strategy  7r-reuse. 

Figure  4(a)  shows  how,  when  biasing  the  exploration  process  for  learning  the  task  f)  with 
the  policies  II 1  and  II4,  the  obtained  gain  increases  dramatically  within  the  first  few  trials  of  the 
execution.  For  instance,  when  reusing  Hi,  in  only  100  iterations  the  average  gain  is  higher  than 
0.15,  and  after  400  iterations  the  value  stays  around  0.2.  When  reusing  II4,  the  gain  is  higher 
than  0.1  after  only  200  trials,  and  after  500  trials  it  stays  around  0.15.  In  both  cases,  the  standard 
deviation  is  high  in  the  initial  trials,  but  it  approaches  0  in  subsequent  trials.  The  behavior  of 
the  test  curves  is  also  very  good  in  both  cases,  showing  that  in  only  400  iterations,  a  gain  higher 
than  0.3  is  obtained  with  a  very  low  deviation.  These  results  demonstrate  that  reusing  similar  past 
policies  produces  a  significant  improvement  over  exploration  strategies  that  learn  from  scratch. 

However,  when  the  learning  is  biased  with  a  very  different  policy,  as  n2  and  n3,  the  average 
gain  shown  in  Figure  4(a)  is  below  0.05,  so  the  learning  process  is  even  worse  than  when  learning 
from  scratch.  Their  test  curves  present  a  better  behavior.  In  both  cases  there  is  an  inflexion  in  the 
test  curve,  obtaining,  at  the  end  of  the  2000  trials,  a  similar  performance  than  unbiased  strategies. 
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The  inflexion  is  due  to  the  learning  of  an  initial  path  to  the  goal.  However,  in  this  case  the  standard 
deviation  is  very  high,  demonstrating  that  the  inflexion  may  occur  in  very  different  moments  of  the 
learning  process. 


6  Similarity  between  Policies 

Previous  results  show  that  reusing  a  past  policy  provides  a  bias  in  the  exploration  process  which 
speeds  up  the  learning.  The  improvement  depends  on  whether  the  reused  policy  solves  a  task 
similar  to  the  one  we  are  currently  learning.  However,  that  is  not  the  only  benefit  of  policy  reuse. 
One  interesting  observation  about  the  results  in  Figure  4(a)  is  that  the  learning  curves  provide  us 
with  a  very  useful  metric  of  similarity  between  policies.  In  that  figure,  the  gain  obtained  for  each 
of  the  past  policies  can  be  understood  as:  (i)  an  estimation  of  how  similar  the  policy  reused  is  to 
the  current  one;  and  (ii)  an  estimation  of  how  useful  the  policy  reused  is  in  order  to  leam  the  new 
policy.  Actually,  the  gain  obtained  by  each  one  could  be  used  to  rank  the  similarity  of  the  past 
policies  with  respect  to  the  new  one.  In  this  case,  the  most  similar  is  n4,  followed  by  11,.  n2  and 

n3. 

Furthermore,  the  estimations  above  can  be  computed  very  fast,  as  Figure  5  demonstrates.  The 
figure  zooms  in  on  the  initial  100  trials  of  Figure  4(a).  The  figure  shows  that  in  only  25  trials,  the 
gain  of  reusing  the  policy  n4  significantly  outperforms  the  gain  of  reusing  the  other  policies.  Thus, 
in  a  total  of  100  trials  (25  for  each  policy),  the  most  similar  policy,  and  therefore,  the  best  policy 
to  reuse,  can  be  computed.  These  ideas  are  formalized  next. 


10  20  30  40  50  60  70  80  90  100 

Trials 


Learning  fromFI  2  Learning  from  II  ^  Learning  from  II  ^ 

Learning  from  II  y 


Figure  5:  Computation  of  the  Similarity  Among  Policies. 


6.1  A  Similarity  Metric  Between  Policies 

As  introduced  below,  we  call  the  optimal  action  policy  for  solving  the  task  fi.  To  be  optimal 
means  that  Wq *  >  IT)1/,  for  all  policies  n  in  the  space  of  all  possible  policies.  Then,  the  following 
theorem  can  be  derived. 
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Theorem:  Given  two  different  tasks,  f \  and  Qj,  and  their  optimal  policies,  fin,  and  II  a, 
respectively,  then: 

(2) 

It  is  easy  to  demonstrate  this  theorem,  given  that  ITq  is  the  optimal  policy  for  the  task  that  we 
are  involved  in  now,  i.e,  H*.  That  ensures  the  maximum  expected  reward  will  be  received,  given 
that: 

VUni  (s)  >  7 Z(s,  a)  +  ^2s/eS  5(s,  a,  (s'),  for  a  =  arga,  max  II ^ .  (s,  a'). 

n* . 

The  previous  theorem  ensures  that  Wn.  1  —  WQ.  '  >  0.  Therefore,  we  can  define  how  useful 
the  policy  II*  could  be  in  learning  the  policy  II,  using  the  distance  metric  shown  in  equation  3. 

Mi^n)  =  O) 

In  this  case  the  distance  metric  is  not  symmetric,  so  <7^(11,:,  IT,-)  could  be  different  from 
<7^ (IT,-,  II,).  Then,  the  most  useful  policy  to  reuse  is: 

arga,  min(IT^n  -  W^Qi ),i  =  l,...,n  (4) 

n* 

However,  Wn  n  is  independent  of  i,  so  the  previous  equation  is  equivalent  to: 

argn .  ma x(w"n<  ),i  =  l,...,n  (5) 

n* 

This  equation  is  not  possible  to  compute,  given  that  Wn  a,:  is  unknown  a  priori.  Furthermore, 
if  we  follow  the  policy  11^.  greedily,  probably  the  task  O  will  never  be  solved.  However,  if  instead 
of  following  greedily,  we  reuse  it  following  the  7r-reuse  exploration  strategy,  we  can  compute 
the  gain  of  reusing  n^.  to  solve  T2.  In  this  sense,  all  the  past  policies  could  be  reused,  computing 
their  respective  gains,  until  an  accurate  estimation  is  obtained.  Then,  past  policies  with  a  lower 
gain  are  discarded,  and  the  one  with  a  higher  gain  is  used  in  the  n -reuse  exploration  strategy.  The 
next  section  describes  a  simple  algorithm  that  applies  these  ideas. 

6.2  An  Algorithm  for  Policy  Reuse 

A  basic  algorithm  for  policy  reuse  from  a  set  of  policies  requires  the  following  two  steps: 

•  Obtain  the  most  similar  policy,  ns.  To  do  this,  it  is  necessary  (i)  to  compute  the  gain  obtained 
when  following  the  it -reuse  exploration  policy  with  each  of  the  past  policies;  and  (ii)  to 
choose  the  policy  with  a  higher  gain.  We  call  Ks  the  number  of  trials  used  to  learn  the 
similar  policy. 

•  Learn  a  new  action  policy,  Ho.  ns  is  used  in  the  7r-reuse  exploration  strategy  to  learn  a  new 
action  policy.  We  call  Kr  the  number  of  trials  used  to  leam  the  new  policy. 

The  previous  steps  are  formalized  in  Table  2  where  we  assume  that  tt -reuse  is  a  method  that 
we  can  call  with  the  parameters  defined  in  Table  1 . 
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Policy  Reuse  Algorithm 

•  Given: 

1.  A  set  of  n  tasks  {Oi , . . . ,  fin}. 

2.  Their  respective  optimal  policies,  (11^, . . . ,  11^}  to  solve  them 

3.  A  new  task,  Q,  that  we  want  to  solve 

4.  A  maximum  number  of  steps  per  trial,  H 

5.  A  maximum  number  of  trials  to  execute,  K 

6.  The  tuple  of  integer  values,  <  Ks,  Kr  >,  such  as  K  =  Ks  +  Kr 

7.  The  parameters  i/t  and  v  used  in  the  exploration  strategy  n -reuse. 

•  for  i=l  to  n  do 

1.  Execute  Tr-rcuscGI,,  Ks/n,  H,  ky  v). 

2.  Obtain  the  associated  gain,  Wt. 

•  Set  ns  =  argn.  max  117 

•  Learn  fh)  by  calling  7r-reuse(IIs,  Kr,  II.  0,  v) _ 


Table  2:  An  Algorithm  for  Policy  Reuse 

6.3  Empirical  Results 

Figure  6  shows  the  learning  curve  obtained  when  the  Policy  Reuse  algorithm  is  executed  for  two 
set  of  parameters,  Ks  =  100  and  Kr  =  1900,  and  Ks  =  400  and  Kr  =  1600  respectively.  This 
learning  curve  demonstrates  that,  even  when  the  similar  policy  must  be  computed,  policy  reuse 
can  be  very  useful  to  bias  the  exploration  of  a  learning  process,  providing  better  performance  than 
learning  from  scratch.  The  test  curves  correspond  with  the  one  shown  in  Figure  4(b)  when  reusing 
n4,  but  delayed  ks  steps. 

However,  the  success  of  policy  reuse,  when  it  is  applied  to  speed  up  learning,  depends  on  sev¬ 
eral  factors.  For  instance,  it  requires  the  definition  of  the  values  of  Ks  and  Kr  that,  at  the  same 
time,  could  depend  of  the  domain,  the  task,  and  the  number  of  past  policies  available.  Furthermore, 
it  does  not  make  the  most  of  the  experience  obtained;  for  instance,  the  experience  used  in  the  com¬ 
putation  of  the  most  similar  policy  could  be  used  also  to  leam  the  new  policy.  Thus,  more  accurate 
algorithms  should  reduce  the  number  of  parameters  used,  and  could  outperform  the  results. 


7  Conclusions  and  Further  Research 

In  this  report,  we  have  described  Policy  Reuse  as  an  exploration  bias  that  balances  the  exploration 
of  random  actions,  the  exploitation  of  the  ongoing  learned  policy,  and  the  exploration  toward  of  a 
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Figure  6:  Learning  Curve  when  following  the  Policy  Reuse  algorithm. 


past  policy.  We  have  instantiated  the  concept  of  Policy  Reuse  by  defining  a  new  exploration  strat¬ 
egy,  7r-reuse.  This  new  strategy  successfully  achieves  the  previous  balance,  and  has  demonstrated 
that  it  can  improve  the  learning  performance  obtained  when  learning  from  scratch  with  different 
strategies. 

Furthermore,  we  have  demonstrated  that  Policy  Reuse  provides  a  similarity  metric  between 
policies.  Such  a  metric  allows  to  discriminate,  from  a  set  of  past  policies,  which  is  the  most 
similar  one  to  the  policy  we  currently  are  learning. 

Policy  Reuse  and  the  concept  of  similarity  introduced  in  this  work  open  a  wide  range  of  chal¬ 
lenging  research  lines,  including  across  domain  or  agent  learning  and  the  ability  to  scale  RL  in 
complexity  considerably. 
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